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Abstract 

Consider a discrete-time system in which a centralized controller (CC) is tasked with assigning 
at each time interval (or slot) K resources (or servers) to K out of M > if nodes. When assigned a 
server, a node can execute a task. The tasks are independently generated at each node by stochastically 
symmetric and memoryless random processes and stored in a finite-capacity task queue. Moreover, they 
are time-sensitive in the sense that within each slot there is a non-zero probability that a task expires 
before being scheduled. The scheduling problem is tackled with the aim of maximizing the number 
of tasks completed over time (or the task-throughput) under the assumption that the CC has no direct 
access to the state of the task queues. The scheduling decisions at the CC are based on the outcomes 
of previous scheduling commands, and on the known statistical properties of the task generation and 
expiration processes. 

Based on a Markovian modeling of the task generation and expiration processes, the CC scheduling 
problem is formulated as a partially observable Markov decision process (POMDP) that can be cast 
into the framework of restless multi-armed bandit (RMAB) problems. When the task queues are of 
capacity one, the optimality of a myopic (or greedy) policy is proved. It is also demonstrated that the 
MP coincides with the Whittle index policy. For task queues of arbitrary capacity instead, the myopic 
policy is generally suboptimal, and its performance is compared with an upper bound obtained through 
a relaxation of the original problem. 

Overall, the settings in this paper provide a rare example where a RMAB problem can be explicitly 
solved, and in which the Whittle index policy is proved to be optimal. 
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I. Introduction and System Model 



The problem of scheduling concurrent tasks under resource constraints finds applications in 
a variety of fields including communication networks dUl, distributed computing O and virtual 
machine scenarios In this paper we consider a specific instance of this general problem 
in which a centralized controller (CC) is tasked with assigning at each time interval (or slot) 
K resources, referred to as servers, to K out oi M > K nodes as shown in Fig. [T] A server 
can complete a single task per slot and can be assigned to one node per time interval. The 
tasks are generated at the M nodes by stochastically symmetric, independent and memoryless 
random processes. The tasks are stored by each node in a finite-capacity task queue, and they 
are time-sensitive in the sense that at each slot there is a non-zero probability that a task expires 
before being completed successfully. It is assumed that the CC has no direct access to the node 
queues, and thus it is not fully informed of their actual states. Instead, the scheduling decision 
is based on the outcomes of previous scheduling commands, and on the statistical knowledge 
of the task generation and expiration processes. If a server is assigned to a node with an empty 
queue, it remains idle for the whole slot. The purpose here is thus to pair servers to nodes so as 
to maximize the average number of successfully completed tasks within either a finite or infinite 
number of slots {horizon), which we refer to as task-throughput, or simply throughput. 
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Figure 1. The centralized controller (CC) assigns K resources (servers) to K out of M > K nodes to complete their tasks in 
each slot t. The tasks of node Ui at slot t are stored in a task queue Qi(t). 
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A. Markov Formulation 

We now introduce the stochastic model that describes the evolution of the task queues across 
slots. In this section we consider task queues of capacity one (see Sec. |V] for capacity larger 
than one), where Qi{t) E {0, 1} denotes the number of tasks in the queue of node Ui, for 
i G {1, M}. The stochastic evolution of queue Qi{t) is shown in Fig. [2] as a function of the 
scheduling decision V({t), which consists in the assignment at each slot t of the K servers to a 
subset U{t) C {Ui,...,Um} of K nodes, with \U{t)\ = K. 

n'"* Pm n*"' n<" Po? n<'> 

Figure 2. Markov model for the evolution of the state of the task queue Qi{t) £ {0, 1}, when the node Ui'. a) is not scheduled 
in slot t (i.e., Ui ^ U{t))\ b) is scheduled in slot t (i.e., Ui £ U{t)). 

At each slot, node Ui can be either scheduled {Ui G U{t)) or not {Ui ^ U{t)). If Ui is not 
scheduled (i.e., Ui ^ U{t), see Fig. [2l-a)) and there is a task in its queue (i.e., Qi{t) = 1), then 
the task expires with probability (w.p.) pf^ = Fr[Qi{t + 1) = 0\Qi{t) = 1, f/^ ^ while it 

remains in the queue w.p. p^^^ = 1 — p^id ■ Instead, if node U is scheduled (i.e., Ui G U{t), see 
Fig. Ill-b)) and Qi{t) = 1, its task is completed successfully and its queue in the next slot is either 
empty or full w.p. p[^fj = Pr[Qi(t + l) = 0\Qi{t) = l,Uie U{t)] and p^^} = l-p[]l, respectively. 
Probability p^^^ accounts for the possible arrival of a new task. If Qi{t) = the probabilities 
of receiving a new task when U is not scheduled and scheduled are pg'J^ = Pr[(5i(t + 1) = 
l|g.(t) = 0,Ui ^ U{t)] and pj,!^ = PT[Qi{t + 1) = l\Qi{t) = 0, f/^ G U{t)], respectively, while 
the probabilities of receiving no task are p^qq = 1 — ply^ and = 1 — p^^^ , respectively. 

B. Related Work and Contributions 

In this work we assume that the CC has no direct access to the state of the task queues 
Qi{t), ...,QM{t), while it knows the transitions probabilities pi^^ with x,y,u G {0,1}, and 
the outcomes of previously scheduled tasks. The scheduling problem is thus formalized as a 



partially observable Markov decision process (POMDP) [4], and then cast into a restless multi- 
armed bandit (RMAB) problem [5J. A RMAB is constituted by a set of arms (the queues in our 
model), a subset of which needs to be activated (or scheduled) in each slot by the controller. 

To elaborate, we assume that the transition probabilities of the Markov chains in Fig. [21 the 
number of nodes M and servers K are such that 

m = M/ K, is integer, and (la) 

pff <pil^<pS^<i>ff. (lb) 

Assumption (fTal) states that the ratio m = M/K between the numbers M of nodes and K of 
servers is an integer, generalizing the single-server case (K = 1). Proving the results provided 
later in this paper for the case of non-integer m remains an open problem. Assumption (flbl ) is 
motivated as follows. The inequality p^^^ < p^y^ imposes that the probability that a new task 
arrives when the task queue is full and the node is scheduled (p^ii) is no larger than when the 
task queue is empty (p^Qi). This applies, e.g., when the arrival of a new task is independent on 
the queue's state and scheduling decisions (i.e., p^i^ = pl^i), or when a new task is not accepted 
when the queue is full, i.e., p^^^ = 0. Inequality Pq^"* < ply^ applies, e.g., when the task generation 
process does not depend on the queue's state and on the scheduling decisions, so that p^Qi = p^^^ , 
or when a new task cannot be accepted while the node is scheduled even if the queue is empty 
(Pm = 0)- Inequality pg'J'' < pf'^ indicates that, when a node is not scheduled, the probability 
p^i that its task queue is full in the next slot, given that it is currently empty, is smaller than 
the probability pf'i that the task queue is full in the next slot given that it is currently full. This 
applies, e.g., when the task generation and expiration processes are independent of each other. 

Main Contributions: When the task queues are of capacity one, and under assumptions ([I]), 
we first show that the myopic policy (MP) for the RMAB at hand is a round robin (RR) strategy 
that: i) re-numbers the nodes in a decreasing order according to the initial probability that their 
respective task queue is full; and then ii) schedules the nodes periodically in group of K by 
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exploiting the initial ordering. The MP is then proved to be throughput-optimal. We then show 
that, for the special case in which p^^^ = ply^ and = pf^ = 0, the MP coincides with the 
Whittle index policy, which is a generally suboptimal index strategy for RMAB problems [6J. 
Finally, we extend the model of Sec. II-Al to queues with an arbitrary capacity C. Characterizing 
optimal policies for C > 1 is significantly more complicated than the case of C = 1. Hence, 
inspired by the optimality of the MP for C = 1, we compare the performance of the MP for 
C > 1, with a upper bound based on a relaxation of the scheduling constraints of the original 
RMAB problem [6J. It is recalled that the results in this paper represent a rare case in which 
the optimal policy for a RMAB can be found explicitly |5l. 

Related Work: The work in this paper is related to the works [|7||, |[8l, in which a RMAB 
problem similar to the one in this paper is addressed. However, the main difference between 
our RMAB and the one in Q, |[8l is the evolution of the arms across slots. In particular, in 
our RMAB, each arm evolves across a slot depending on the scheduling decision taken by the 
controller, while in |I71, the evolution of the arms does not depend on the scheduling decision. 
The transition probabilities for the RMAB in [|7]|, |[8l are thus equivalent to setting ply^ = p^Q^ 
and pfi = p^ii in the Markov chains of Fig. [21 For instance, our model applies to scenarios in 
which the arms are, e.g., data queues, where each arm draws a data packet from its queue only 
when scheduled. Instead, the model in Q, |[8l applies to scenarios in which the arms are, e.g., 
communication channels, whose quality evolves across slots regardless whether they are selected 
for transmission or not. 

In Q it is shown that the MP is optimal for pp^^ = p''^^ < p^^^ = p^^^ with K = 1, while 
im extends this result to an arbitrary K. The work ITJ also demonstrates that the MP in not 
generally optimal in the case p^^^ = p^^^ > pf^ = p^^^ . Finally, paper proves the optimality 
of the Whittle index policy for p^^^ = p^^^ < pf-^ = p^^^ . We emphasize that neither our model 
nor the one considered in Q, [m subsumes the other, and the results here and in the mentioned 
previous works should be considered as complementary. 
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Notation: Vectors are denoted in bold, while the corresponding non-bold letters denote the 
vectors components. Given a vector x = [xi, xm] and a set S = {ii, z/^-} C {1, M} of 
cardinality K < M, we define vector = [x^^, ...,Xij^], where ii < ... < ix- A function /(x) 
of vector x is also denoted as /(xi, ...,xm) or as f{xi, x;, X{/+i m}) for some 1 < / < M, 
or similar notations depending on the context. Given a set A and a subset B C A, B'^ represents 
the complement of B in A. 

II. Problem Formulation 

Here we formalize the scheduling problem of Sec. H] (see Fig. [1]), in which the task generation 
and expiration processes are modeled, independently at each node, by the Markov models of 
Sec. II-AI with queues of capacity one. Extension to task queues of arbitrary capacity is addressed 
in Sec. |Vl 

A. Problem Definition 

The scheduling problem at the CC is addressed in a finite-horizon scenario over slots t E 
{1,...,T}. Let Q(t) = [<5i(t), <5j\/(t)] be the vector collecting the states of the task queue 
at slot t. At slot t = 1, the CC is only aware of the initial probability distribution aji(l) = 
[u;i(l), c<;Af(l)] of Q(l), whose ith entry is c<;i(l) = Fr[Qi{l) = 1]. Thus, the subset W(l) 
of |W(1)| = K nodes scheduled at slot t = 1 is chosen as a function of the initial distribution 
only. For any node Ui E U{t) scheduled at slot t, an observation is made available to 
the CC at the end of the slot, while no observations are available for non-scheduled nodes 
Ui ^ U{t). Specifically, if Qi{t) = 1 and Ui E U{t), the task of Ui is served within slot t, 
and the CC observes that Qi{t) = 1. Conversely, if Qi{t) = and Ui E U{t), no tasks are 
completed and the CC observes that Qi{t) = 0. We define 0(t) = {Qi(t) : U E U(t)} as the 
set of (new) observations available at the CC at the end of slot t. At time t, the CC hence knows 
the history of all decisions and previous observations and the initial distribution <^(1), namely 
Hit) = {W(l), ...M(t - 1), Oit - 1), with H{1) = {a;(l)}. 



7 



Since the CC has only partial information about the system state Q(t), through 0(t), the 
scheduling problem at hand can be modeled as a POMDP. It is well-known that a sufficient 
statistics for taking decisions in such POMDP is given by the probability distribution of Q{t) 
conditioned on the history 'H(t) [lOJ, referred to as belief, and represented by the vector uj(t) = 
[uJi{t), ...,UMit)], with ith entry given by 



Since the belief u!(t) fully summarizes the entire history Tiit) of past actions and observations 
ifTOl . a scheduling policy n= [W'^(l), W^(T)] is defined as a collection of functions V{^{t) that 
map the belief uj{t) to a subset of = K nodes, i.e., U{t). We will 

refer to as the subset of scheduled nodes, even though, strictly speaking, it is the mapping 

function defined above. The transition probabilities over the belief space are derived in Sec. III-BI 
The immediate reward R{uj,V(), accrued by the CC when the belief vector is uj and action 
U is taken, measures the average number of tasks completed within the current slot, and it is 



Notice that R{u),V() < K since there are only K servers. 

The throughput measures the average number of tasks completed over the slots {1, T} that, 
by exploiting ([3]) and under policy vr, is given by 

T 

(a;(l)) = Y^P'-^W [R {u;{t),W{t)) . (4) 

t=i 

In dH), the expectation E''[-|a;(l)], under policy vr for initial belief (^(1), is intended with respect 
to the distribution of the Markov process uj{t), as obtained from the transition probabilities to 
be derived in Sec. III-BI For generality, the definition ^ includes a discount factor < /3 < 1 
Q, while the infinite horizon scenario (i.e., T — )■ oo) will be discussed in Sec. IIII-CI 



u,{t) = Pi [Q,{t) = i\n{t)]. 



(2) 




(3) 



The goal is to find a policy vr* = ...,W*(T)] that maximizes the throughput (U) so that 



V* (a;(l)) = Vf (a;(l)) = max^^" (a;(l)) , with tt* = argmax (a;(l)) (5) 



5. Transition Probabilities 

The belief transition probabilities, given decision U{t) = U and a;(t) = = [wi, ...,a;Af], are 



M 



i=l 

(6) 

where ci;(t + 1) = cj' = [uj[, w^j], while the distribution of entry uji{t + 1) is (see Fig. ^ 



Pr[wi(t + 1) = = = W] = < 



uji if = j9^^]^^ and Ui 

(1 - Wi) if = and f/^ G W , (7) 

1 if = ro^^^(u;i) and [/i ^ W 



where we have defined the deterministic function 



T^'\u) = PTmt + l) = l\u,{t)=u,U,^U{t)]=up';',' + {l-u)p]^,' =u6o+p^^^ (8) 

to indicate the next slot's belief when Ui is not scheduled (f/j ^ U{t)), with 5q = p^i^ —p^Qi > 
due to inequalities (flbl ). Eq. ([8]) follows from Fig. [21-a), since the next slot's belief is either p^^^ 
if Qi{t) = 1 (w.p. oo) or Pq°^ if = (w.p. (1 — oo)). A generalization of function Tq^\uj) 
that computes the belief coi{t + k) of node Ui when it is not scheduled for k successive slots, 
e.g., slots {t, ...,t + k — 1}, and uji{t) = co, can be obtained as 



(0) 



.(0) 



Tl^'\u) = PT[B,{t + k) = l\uji{t) = uj,Ui ^U{t), ...,Ui ^U{t + k-l)] = u6^+p'^> 

1 - do 



(9) 



Eq. Q can be obtained recursively from ([8]) as Tq''\uj) = Tq''\tq'' '^\^))^ for k > 1, with 



Tn iOj] = U. 
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Under assumptions (flbl) . it is easy to verify that function dS]) satisfies the inequalities 



Pii < Vol < 1-0^ V)' for all < w < 1, and 



(10) 



To {(^) < T^ \uj'), for all Lo <uj' with 0<uj,uj' <1. 



(11) 



The inequalities in (flOl) guarantee that the belief of a non-scheduled node is always larger than 
that of a scheduled one, while the inequality ([TT]) says that the belief ordering of two non- 
scheduled nodes is maintained across a slot. These inequalities play a crucial role in the analysis 
below. 

C. Optimality Equations 

The dynamic programming (DP) formulation of problem ([5]) (see e.g., ffTTI ) allows to express 
the throughput recursively over the horizon {t, T}, under policy vr and initial belief uj, as 

T 



Vn^) = 5^/3^'^*E- [i?(a;(j),W-(j)) Mt) = a;] = (a;,W-(t)) + /? J]£,V,!^i(u;'), (12) 



where V^t'^(-) = for t > T. The DP optimality conditions (or Bellman equations) are then 
expressed in terms of the value function Vf*{u:) = max^ VJ'^(a;), which represents the optimal 
throughput in the interval {t, ...,T}, and it is given by 



Note that, since the nodes are stochastically equivalent, the value function (fT3l) only depends on 
the numerical values of the entries of the belief vector uj regardless of the way it is ordered. 
Finally, an optimal policy n* = [W*(l), W*(T)] (see ©) is such that W*(t) attains the 
maximum in the condition (fT3l) for t G {1, 2, T}. 



We now define the myopic policy (MP) and show that, under assumptions ([T]), it is a round- 
robin (RR) policy that schedules the nodes periodically and that it is optimal for problem ([5]). 




(13) 



III. Optimality of the Myopic Policy 
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A. The Myopic Policy is Round-Robin 

The MP TT^^^ = ...,W*^^(T)}, with throughput is the greedy policy that 

schedules at each slot the K nodes with the largest beliefs so as to maximize the immediate 
reward ([3]), that is, we have 



Proposition 1. Under assumptions ([B, the MP (fT4l) . given an initial belief is a RR 

policy that operates as follows: 1) Sort vector (^'(1) in a decreasing order to obtain uj{l) = 
wm(1)] such that (X'i(l) > ... > ujm{^)- Re-number the nodes so that f/j has belief 
uji{l); 2) Divide the nodes into m groups of K nodes each, so that the gth group Qg, g E 
{l,...,m}, contains all nodes Ui such that g = [^\ + 1, namely: Qi = {Ui, ...,Uk}, G2 = 
{Uk+1, ■■■,U2k}, and so on; 3) Schedule the groups in a RR fashion with period m slots, so 
that groups ^1, ^m, ^1, ••• are sequentially scheduled at slot t = l,...,m, m + 1,... and so on. 

Proof: According to ([H), the first scheduled set is U^^{1) = ^1 = {f/i, f/2, t/x}- The 
beliefs are then updated through dV]). Recalling (flOl) . the scheduled nodes, in Qi, have their belief 
updated to either p^^^ or p^^^, which are both smaller than the belief of any non-scheduled node 
in {[/i, f/jv/} \ ^1. Moreover, the ordering of the non-scheduled nodes' beliefs is preserved 
due to (fTTl) . Hence, the second scheduled group is U^''^{2) = Q2, the third is W^^^(3) = Q3, 
and so on. This proves that the MP, upon an initial ordering of the beliefs, is a RR policy. ■ 
We emphasize that the MP sorts the beliefs of the nodes only at the first slot in which it is 
operated, and then it keeps scheduling the groups of nodes according to their initial ordering, 
without requiring to recalculate the beliefs. 

B. Optimality of the Myopic Policy 

We now prove the optimality of the MP by showing that it satisfies the Bellman equations 
(fT3l) . To start with, let us consider a RR policy vr^^ that operates according to steps 2) and 




(14) 
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3) of Proposition \T\ (i.e., without re-ordering the initial belief), and let its throughput (fT2l) be 
denoted by V/^^{uj). Note that, when the initial belief uj is ordered so that coi > ... > um, then 
V/^^{uj) = V/''^^{uj). Based on backward induction arguments similarly to Q, |[8l, the following 
lemma establishes a sufficient condition for the optimality of the MP. 

Lemma 2. Assume that the MP is optimal at slot t + 1, T, i.e., it satisfies (fT3l) . To show that 
the MP is optimal also at slot t it is sufficient to prove the inequality 

V^,^«(a;5, ujs^) < V,''''{uJs, ujs^) = V/'''{ui, U2, um), for all u;i > ^2 > - > oum (15) 

and all sets S C {1, M} of K elements, with the elements in w^c decreasingly ordered. 

Proof: Since the MP is optimal from t + 1 onward by assumption, it is sufficient to show 
that scheduling K nodes with arbitrary beliefs at slot t and then following the MP from slot 
t + 1 on is no better than following the MP immediately at slot t. The performance of the 
former policy is given by the left-hand side (LHS) of (fT5l) . In fact Vf^^{uJs, cj^c), for any set S, 
represents the throughput of a policy that schedules the K nodes with beliefs ujs at slot t, and 
then operates as the MP from t + 1 onward, since beliefs in cj^c are decreasingly ordered. The 
MP's performance is instead given by the right-hand side (RHS) of (fT5l) . Note that, for t = T, 
it is immediate to verify that the MP is optimal. This concludes the proof. ■ 

Theorem 3. Under assumptions (O the MP is optimal for problem Q, so that tt*^^ = n*. 

Proof: To start with, we first prove in Appendix |A] that the inequality 

V/^^{ui,...,Uj,y,x,...,UM) < Vt^^{ui,...,Uj,x,y,...,UM) (16) 

holds for any x > y, with < j < M—2, and for all t G {1, T} and beliefs cuk (not necessarily 
ordered), with k G {1,...,M}. Inequality (fT6l) for j = is intended as V/^^{y, x, ...^cjm) < 
V/^^{x,y, ...,um)- If (fT6l) holds, then inequality (fT5l) is satisfied for all lui > ... > um and all 
subsets S C {1,...,M} of K elements. In fact, (fT6l) states that the throughput of the RR policy 
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never increases when, for any pair of adjacent nodes, the one with the smallest belief of the 
pair is scheduled first. Hence, by starting from the RHS of (fTSl) (i.e., V/^^{ui,U2, ...,ujm)) and 
by applying a convenient number of successive switchings between pair of adjacent elements of 
vector [u;i,Co'2, ■■■■,^m] to achieve [oj^jCJ^c], for any S, we can obtain a cascade of inequalities 
through (fT6l) (one for each switching), which guarantees that (fTSi) holds. By Lemma [21 this is 
sufficient to prove that the MP is optimal, since the inequality (fTSl) holds for any arbitrary t. ■ 

C. Extension to the Infinite-Horizon Case 

We now briefly describe the extension of problem ([5]) to the infinite -horizon case, for which 
the throughput under policy vr and its optimal value are given by (see e.g., Q) 

oo 

\/^(c.j(l)) = y /3*-^E" , and V* {uj{l)) = m&^V^ {u){l)) , (17) 

t=l 

where the optimal policy is tt* = argmax^ and < /3 < 1. From standard DP theory 

ifm . the optimal policy vr* is stationary, so that the optimal decision U*{t) is a function of the 
current state ijj{t) only, independently of slot t ifTTI . By following the same reasoning as in Q 
Theorem 3], it can be shown that the optimality of the MP for the finite-horizon setting implies 
the optimality also for the infinite-horizon scenario. Moreover, by following Q Theorem 4] it 
can be shown that the MP is optimal also for the undiscounted average reward criterion (i.e., 
(^(1)) = UniT^^^ TZi [R 1^(1)])- 

IV. Optimality of the Whittle Index Policy 

In this section, we briefly review the Whittle index policy for RMAB problems BH, and then 
focus on the infinite-horizon scenario of Sec. IIII-Cl when conditions (flbl) are specialized to 

= p(l < p^^l = pi? = poi < pfl = 1, (18) 

and where the task queues are of capacity one. We show that under the assumption (fTSl ) (see 
Sec. II-Bl for a discussion on these conditions), the RMAB at hand is indexable and we calculate 
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its Whittle index in closed-form. We then show that the Whittle index policy is equivalent to 
the MP, and thus optimal for the problem (flTl) . 

We emphasize that, our results provide a rare example |T| in which, as in BUl, not only 
indexability is established, but also the Whittle index is obtained in closed form and the Whittle 
policy proved to be optimal. It is finally remarked that our proof technique is inspired by flU, 
but the different system model poses new challenges that require significant work. 

A. Whittle Index 

The Whittle index policy assigns a numerical value W{uji) to each state cui of node Ui, referred 
to as index, to measure how rewarding it is to schedule node Ui in the current slot. The K nodes 
with the largest index are then scheduled in each slot. As detailed below, the Whittle index 
is calculated independently for each node, and thus the Whittle index policy is not generally 
optimal for RMAB problems. Moreover, even the existence of a well-defined Whittle index is 
not guaranteed jH. To study the indexability and the Whittle index for the RMAB at hand, we 
can focus on a restless single-armed bandit (RSAB) model, as defined below L5J. A RSAB is 
a RMAB with a single arm, in which the only decision that needs to be taken by the CC is 
whether activating the (single) arm or not (i.e., keep it passive). 

1 ) RSAB with Subsidy for Passivity: The Whittle index is based on the concept of subsidy 
for passivity, whereby the CC is given a subsidy m G M when the arm is not scheduled. At each 
slot t, the CC, based on the state uj{t) of the arm, can decide to activate (or schedule) it, i.e., 
to set u(t) = 1, obtaining an immediate reward Rm{(^it), 1) = co{t). If, instead, the arm is kept 
passive, i.e., u{t) = 0, a reward Rm{co{t), 0) = m equal to the subsidy is accrued. The state uj{t) 
evolves through (|7]), which under (fTSi) and adapted to the simplified notation used here becomes 







w.p. uj{t) 



if u{t) = 1 



w.p. (1 -w(t)) if u{t) = 1 



(19) 



r^'\u;{t)) w.p. 1 



if u{t) = 
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The throughput, given policy vr = {u'^(l), u'^(2), ...} and initial belief co{l), is 

oo 

V:,{u{l)) = Y,/3'-'E-[RMt),u-m^i^)]- (20) 



t=l 



The optimal throughput is = max^ V^{u{l)), while the optimal policy 

TT* = argmax^yj[ ('^(1)) is stationary in the sense that the optimal decisions u*^X^) ^ {O5 1} 
are functions of the belief cu only, independently of slot t Removing the slot index from the 
initial belief, the optimal throughput {to) and the optimal decision ^^{co) satisfy the following 
DP optimality equations for the infinite-horizon scenario (see flU) 

= max {VUoo\u)}, (21) 
Me{o,i} 

and n*„(a;) = arg max {Vm{uj\u)} . (22) 
ue{o,i} 

In (|2T]) - ((22)) we defined Vm,iuj\u), u G {0, 1}, as the throughput (|20|) of a policy that takes action 
u at the current slot and then uses the optimal policy {uj) onward, we have 

VM^) = m + /3K:(r(^)(u;)), and (23) 
V^{oj\l) = a; + /3[a;K:(0) + (l-a;)K:(poi)]. (24) 

2) Indexability and Whittle Index: We use the notation of [9J to define indexability and Whittle 
index for the RSAB at hand. We first define the so called passive set 

V{m) = {w: < w < 1 and u*^{uj) = 0} , (25) 

as the set that contains all the beliefs to for which the passive action is optimal (i.e., all < a; < 1 
such that Kn(a;|0) > ym(a;|l), see (I23l)- (l24)) ') under the given subsidy for passivity m G M. 
The RMAB at hand is said to be indexable if the passive set V{m), for the associated RSAB 
problenu, is monotonically increasing as m increases within the interval (—00, +00), in the 

'Note that in a RMAB with arms characterized by different statistics this condition must be checked for all arms. 
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sense that V{m') C V{m) if m' < m and V{-oo) = and V{+oo) = [0, 1]. 

If the RMAB is indexable, the Whittle index W{uj) for each arm with state to is the infimum 
subsidy m such that it is optimal to make the arm passive. Equivalently, the Whittle index W{uj) 
is the infimum subsidy m that makes passive and active actions equally rewarding, i.e., 

W{u) = inf {m: u*^{uj) = 0} = ini {m: Vm itu\0) = V„, {uj\l)} . (26) 

B. Optimality of the Threshold Policy 

Here, we show that the optimal policy u*^{ijj) for the RSAB of Sec. IIV-AII is a threshold 
policy over the belief to. This is crucial in our proof of indexability of the RMAB at hand given 
in Sec. IIV-DI To this end, we observe that: z) function Vm(w|l) in (f24)) is linear over the belief 
uj\ ii) function Kn(c<;|0) = m + I^V^{t^^ (cu)) in (|23l) is convex over oj, since the value function 
V^(w) is convex for the problem at hand (see BUl, [[TOl '). We need the following lemma. 

Lemma 4. The following inequalities hold: 

a) For^<m<l: a.ij K^(0|1) < V;„(0|0) < a.2j K^(1|0) < (27a) 

b) Form<{]: Z7.ij Kn(0|0) < V;,(0|1) < Kn(l|l); fe2j Kn(l|0) < F^(l|l); (27b) 

c) Form>l: c.l) VUO\0) < Kn(l|l) < Kn(0|l); c.2) Kn(l|l) < Kn(l|0). (27c) 

Proof: See Appendix El ■ 
Leveraging Lemma SI we can now establish the optimality of a threshold policy M^(a;). 

Proposition 5. The optimal policy u^X^) (l22l) for subsidy m G M is given by 

1, ifuj> uj*(m) 
= { , (28) 

0, if UJ < uj*{m) 

where u;*(m) G M is the optimal threshold for a given subsidy m. The optimal threshold uj*{m) 
is < Ci;*(m) < 1 if < m < 1, while it is arbitrary negative for m < and arbitrary greater 
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than unity for m > 1. In other words we have u^(co) = 1 if m < and u^{uj) = if m > 1. 

Proof: We start by showing that (|28l) . for < m < 1, satisfies (|22l) and is thus an optimal 
policy. To see this, we refer to Fig. [3l where we sketch functions and ym(a;|0) for 

different values of the subsidy m. From (|22)) . we have that u^{uj) = 1 for all cu such that 
Kn(i^|l) > V^m(i^|0) and u'^{u) = otherwise. For < m < 1, from the inequalities of 
Lemma IH- a), the linearity of and the convexity of l^(u;|0), it follows that there is 

only one intersection uj*{m) between Kn(i^|l) and Vm((X'|0) with < uj*{m) < 1, as shown 
in Fig. [3l-a). Instead, when m < 0, by Lemma IH-b), arm activation is always optimal, that is, 
u'^{u) = 1, since Kn((X'|l) > Vm{ui\0) for any < w < 1 as shown in Fig. [3]-b). Conversely, 
when m > 1, by Lemma IH-c), it follows that passivity is always optimal, that is, u^{uj) = 0, 
since \4i('^|0) > Kn(a;|l) for any < < 1 as shown in Fig. [31-c). ■ 




(o'{m) la la 1 ft; 



Figure 3. Illustration of the optimality of a threshold policy for different values of the subsidy for passivity m: a) < m < 1; 
b) m < 0; c) m > 1. 

C. Closed-Form Expression of the Value Function 

By leveraging the optimality of the threshold policy (1281) we derive a closed-form expression 
of V^{u) in (|2T]) . being a key step in establishing the RMAB's indexability in Sec. IIV-DI 
Notice that function tI^''\uj) in Q, when specialized to conditions (fTSl ), becomes 

tI,'\u) = l-il-poi)\l-uj), (29) 

which is a monotonically increasing function of k, so that Tq''\u)) > Tq\u)) for any k > i. 
Based on such monotonicity, we can define the average number L(uj, oj') of slots it takes for the 
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belief to become larger than oj' when starting from oj while the arm is kept passive, as 



L{uj,uj') = min j/c: Tq''\uj) > 



u > u' 

ln( 



1- 



In(l-poi) 



+ 1 U <U' . (30) 

OO u < 1 < u' 



From (l30l) we have L(uj,uj') = 1 for cu = cu' since, without loss of optimality, we assumed that 
the passive action is optimal (i.e., ^^^{u) = 0) when Vm{uj\0) = Vm{uj\l)- For lu' > 1 instead 
(according to Proposition [S]), the arm is always kept passive and thus L{u,uj') = oo. 

Lemma 6. The optimal throughput V^{uj) in (|2TI) can be written as 

1 oL{uj,uj* (m)) 

= m + /3^(-'-*(-))v;„(r(^('^'^*("^«)(a;)|l), (31) 

where io*{m) is the optimal threshold obtained from Proposition |51 

Proof: According to Proposition [51 the optimal policy u*^{uj) keeps the arm passive as long 
as the current belief is w < w*(m). Therefore, the arm is kept passive for L(a;,aj*(m)) slots, 
during which a reward Rmi^, 0) = m is accrued in each slot. This leads to a total reward within 
the passivity time given by the following geometric series ^2^=0^^ _ '"'" m, 

which corresponds to the first term in the RHS of (|3T1) . After L{uj, uj*{m)) slots of passivity, the 
belief becomes larger than the threshold co*{m) and the arm is activated. The contribution to the 
value function V{u) thus becomes V^(rQ^*-'^''^ '^'"■'^■'(a;)|l), which is the second term 

in the RHS of (|3T1) . Note that, when u > u*{m), activation is optimal, and V* (u) = V . ■ 
To evaluate V^{uj) from (|3TI) . we only need to calculate Vm(a;|l) since the other terms, thanks 
to (|30l) are explicitly given once uj*{m) is obtained from Proposition [51 However, from (|24|) . 
evaluating Vm{oo\l) only requires V^(0) and V^{pqi), which are calculated in the lemma below. 
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Lemma 7. We have 

(m - 2ml3^*^ + (^^"'v^ - /S^'^^'^^v^ + m/3^™+^ + m(5^*^v*^ - ml3^*^^^vl^ 

(/5 - 1) - P^*mv*^ + - 1) ^^^""^ 

where we have defined L*^ = L(0,w*(m)) and = Tq^^^''^ ^""^^^(0). 

Proof: By plugging (l24l) into (|3T| ). and evaluating (|3T1) for a; = and cu = Poi, we get a 
linear system in the two unknowns V^{0) and K^(poi)' which can be solved leading to (|321 ). ■ 



Z). Indexability and Whittle Index 

Here, we prove that the RMAB at hand is indexable, we derive the Whittle index in closed 
form and show that it is equivalent to the MP and thus optimal for the RMAB problem (flTl) . 

Theorem 8. a) The RMAB at hand is indexable and b) its Whittle index is 

[l _ j^mu) A _ /3^^^(o.-)(o) (1 - f3)(i-h))]uj + /3^(0'")rJ'^°'"^(0) (1 - /3) {hl3 + 1) 

^i^) = 7 ^-r-. XXX 

MMo,..)(i_/3(i_/,))^_ h + /3L(o,<.) fT-o^(0'-)(o)(i-/3) + /i/3jjj 

(33) 

Proof: Part a). See Appendix O Part b). By the Whittle index W{uj) of state u is 
the value of the subsidy m for which activating or not the arm is equally rewarding so that 
Vm{uj\0) = Vm{uj\l). By using (El])-® this becomes u + P[ujV^{0) + {1 - uj)V^{poi)] = 
m + PV^{tq^^ i^))- Moreover, since the threshold policy is optimal and t^^^ {u) > u, it follows 
that, when the belief becomes Tq^^ (u), it is optimal to activate the arm and thus V^{tq^^ (u)) 
= VUr^'^ (u) |1) = f3T^'\u)V:^iO) + P{l-Ti'\u))V:^ipoi). Plugging this resuk into Kn (c^|0) = 
Vm{uj\l), along with (|32al) and (|32bl) . leads to (l33l) . which concludes the proof. ■ 
It can be show that the Whittle index W{uj) in (l33l) is an increasing function of to. Therefore, 
since the Whittle policy selects the K arms with the largest index at each slot, we have: 

Corollary 9. The Whittle index policy is equivalent to the MP and is thus optimal. 
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V. Extension to Task Queues of Arbitrary Capacity C > 1 

The problem of characterizing the optimal policies when C > 1 is significantly more compli- 
cated than for C = 1 and is left open by this work. Moreover, since the dimension of the state 
space of the belief MDP grows with C, even the numerical computation of the optimal policies 
is quite cumbersome. Due to these difficulties, here we compare the performance of the MP, 
inspired by its optimality for C = 1, with a performance upper bound obtained following the 
relaxation approach of jU. 




Figure 4. Markov model for the evolution of the queue Qiii), of arbitrary capacity C, when the node Uc a) is not scheduled 
in slot t (i.e., Ui <f U{t)); b) is scheduled in slot t (i.e., Ui € U{t)). 

A. System Model and Myopic Policy 

Each node Ui has a task queue Qi{t) E {0, 1, C} of capacity C. We consider the Markov 
model of Fig. |4] for the task generation and expiration processes at each node (cf. Sec. II- Al) . The 
transition probabilities between queue states when node Ui is not scheduled are p^xy = Pr[Qi{t + 
1) = y\Qi{t) = x,Ui ^ whereas when U is scheduled we have p^xy = Pr[Qi{t + 1) = 

y\Qiit) = x,Ui G for x,y e {0,1,. ..,C}. When node U is scheduled at slot t, and 

Qiit) > 1, one of its task is executed and it also informs the CC about the number of tasks left 
in the queue (observation). We assume that at most one task can be generated (or dropped) in 
a slot, so that p^J = for ?/ < x — 1 and y > x + 1, with u G {0, 1} as shown in Fig. |4l 

The belief of each ith node is represented by a (C x 1) vector uji = [uifi, cjj^c-i] whose 
kth entry Ui^k, for A; G {0, 1, C - 1}, is given by (cf. ©) u,^k = Pr [Qiit) = k\n{t)] . The 
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immediate reward dS]), given the initial belief vectors uji{t), ujM{t) and action U, becomes 

M 



R{u^^{t),...,u:M{t),U) = $^Pr[g,(t)>0|-H(t)]l(f/, GW) = i^-5^u;,,o(t). (34) 



The performance of interest is the infinite-horizon throughput (flTI) . 

1) Myopic Policy: The MP (fT4l) . specialized to the immediate reward (|34l) . becomes 



Note that, unlike Sec. IIII-A[ when C > 1 the MP does not generally have a RR structure. 
B. Upper Bound 

Here we derive an upper bound to the throughput (fTTI) by following the approach for general 
RMAB problems proposed in The upper bound relaxes the constraint that exactly K nodes 
must be scheduled in each slot. Specifically, it allows a variable number K'^(t) of scheduled 
nodes in each tth slot under policy tt, with the only constraint that its discounted average satisfies 



The advantage of this relaxed version of the scheduling problem is that it can be tackled by 
focusing on each single arm independently from the others lIH, [fT2l . This is because, by the 
symmetry of the nodes, the constraint (l36l) can be equivalently handled by imposing that each 
node is active on average for a discounted time E'^[}2^i ^ ^'^(^))] = M(f-i3) ■ ^^'^ 

thus calculate the optimal solution of the relaxed problem by solving a single RSAB problem. 

We now elaborate on such a RSAB by dropping the node index. Here, the immediate reward 
when the arm is in state lj (a vector since C > 1, see Sec. IV-AI) . and action u E {0, 1} is chosen, 
is -R(a;, m) = 1 — Wo if M = 1 and -R(cj, m) = if m = 0, while the Markov evolution of the belief 
follows from Fig. |4] and similarly to Sec. II-A[ The problem consists in optimizing the throughput 



under the constraint ^"[Et=i e W"(t))] = Et=i /3*"^^"K(^)] = K/{M{l-(3)), as 



1=1 




(35) 




(36) 
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introduced above. Under the assumption that the state uj belongs to a finite state space W (to 
be discussed below), this optimization can be done by resorting to a linear programming (LP) 
formulation [12|. Specifically, let z!:^'^ be the probability of being in state u: and selecting action 
u G {0, 1} under a given policy. The optimization at hand leads to the following LP 

maximize ^^^^ R{uj,u)z^j\ (37a) 
subject to : ^ = 1, (37b) 

LJ,U 

4°' + 4" = + forall i^eW. (37d) 

where (|37cl) is the constraint on the average time in which the node is scheduled, while (|37dl) 
guarantees that zj^^ is the stationary distribution lfT2l . in which 6 [uj — U!{1)) = 1 if uj = a;(l) 
and 6 {uj — a;(l)) = if a; 7^ uj(1) . Note that, as discussed in Sec. HI the term p^^' is the 
probability that the next state is u;' given that action u is taken in state uj. 

We are left to discuss the cardinality of the set W. While the belief uj can generally assume 
any value in the C-dimensional probability simplex, the number of states actually assumed by 
ui during any limited horizon of time is finite due to the finiteness of the action space [[TOl . In 
our problem, since the time horizon is unlimited, this fact alone is not sufficient to conclude 
that the set W is finite. However, after each tth slot in which the arm is activated, the belief 
at the (t + l)th slot can only takes C values given that the queue state is learned by the CC. 
Therefore, the evolution of the belief is reset after each activation, and in practice, the time 
between two activations is finite since the node must be kept active for a discounted fraction of 
time K/ (M(l — (3). Hence, by constraining the maximum time interval between two activations 
to a sufficiently large value, the state space W remains finite and the optimal performance is not 
affected. We used this approach for the numerical evaluation of the upper bound in Sec. IV-CI 
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C. Numerical Results 

We now present some numerical results to compare the performance of the MP with the 
upper bound of Sec. IV-BI The performance is the throughput (flTl) normalized by its ideal value 
K/ (1 — /3) that is obtained if the nodes always have a task to be completed when scheduled. 

In Fig. \5\ we show the normalized throughput versus the queue capacity C for different ratio 
M/ K between the number M of nodes and the number K of nodes scheduled in each slot. We 
keep K = 3 fixed and vary M. We assume a uniform distribution for the initial number of tasks 
in the queues Qi{l) for all the nodes, so that ^(1) = 1/ (C + 1) for all z, k. The probabilities 
that a new task is generated when the arm is kept passive are p^^^ = 0.15 and pf^_^_l = 0.1, for 
/c G {1, C — 1}, while under activation they are p^^^ = 0.05 and = 0. The probability that 

a task expires when the arm is kept passive and activated are pf^_i = 0.05 and p^^^^i = 0.95 
respectively. The remaining transitions probabilities are p^?^ = 0.9, p^l; = 0.05, while /3 = 0.95. 

From Fig. [5] it can be seen that when C and/or M/K are small the MP's performance is close 
to the upper bound. In fact, for small M/ K, most of the nodes are scheduled in each slot and 
the relaxed system in Sec. IV-BI approaches the original one, while for small C we get closer to 
the optimality of the MP for C = 1. For moderate to large values of M/K and/or C instead, 
the more flexibility in the relaxed system enables larger gains over the MP. 

VI. Conclusions 

This paper considers a centralized scheduling problem for independent, symmetric and time- 
sensitive tasks under resources constraints. The problem is to assign a finite number of resources 
to a larger number of nodes that may have tasks to be completed in their task queue. It is assumed 
that the central controller has no direct access to the queue of each node. Based on a Markovian 
modeling of the task generation and expiration processes, the scheduling problem is formulated 
as a partially observable Markov decision process (POMDP) and then cast into the framework 
of restless multi-armed bandit (RMAB) problems. Under the assumption that the task queues 
are of capacity one, a greedy, or myopic policy (MP), operating in the space of the a posteriori 
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Figure 5. Normalized optimal throughput of the MP in J35t as compared to the upper bound versus the queue capacity C for 
different ratios M/K G {1, 3, 10} (system parameters are A' = 3, /3 = 0.95, uji,k{l) = 1/(C + 1) for all i, k, = 0.15, 
P<V = 0.05, p'^^l = 0.9, pg^ = 0.05, = 0.05, p^l, = 0.95, = 0-1' pti+i = 0, for ke{l,C- 1}). 

probabilities (beliefs) of the number of tasks in the queues, is proved to be optimal, under 
appropriate assumptions, for both finite and infinite-horizon throughput criteria. The MP selects 
at each slot the nodes with the largest probability of having a task to be completed. It is shown 
that the MP is round-robin since it schedules the nodes periodically. We have also established 
that the RMAB problem at hand is indexable, derived the Whittle index in closed form and 
shown that the Whittle index policy is equivalent to the MP and thus it is optimal. 

Systems in which the task queues have arbitrary capacities have been investigated as well 
by comparing the performance of the MP, which is generally suboptimal, with an upper bound 
based on a relaxation of the scheduling constraint. 

Overall, this paper proposes a general framework for resource allocation that finds applications 
in several areas of current interest including communication networks and distributed computing. 

Appendix A 
Proof of Theorem [3] 

The proof is divided into two steps. In the first step we derive the throughput of the RR policy 
in closed form, and then we show that inequality (fT6l ) holds. 
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As for the first step, the throughput for the RR policy (and thus of the MP) can be calculated 
as the sum of the contribution of each node separately (due to the round robin structure). To 
elaborate, let us focus on node Ui, with initial belief C0i{l), and assume that Ui E Gi. Nodes in 
group ^1 are scheduled at slots t G {l + (j-l)m}, for j G {1,2, ...}. Let rj{uJi{l)) = E^^[wi(l + 
(j — l)m) |cjj(l)] be the average reward accrued by the CC from node Ui only, when scheduling it 
for the jth time at slot t = 1 + (j — l)m (see the RHS of ©) (i.e., when operating the RR policy). 
At slot t = 1 we have ri{ui{l)) = C0i{l). To calculate r2(u;j(l)) we first derive the average value 
of the belief (see ©) after the slot of activity in t = 1 as E^^[uji{2)\uji{l)] = T[^\cOi{l)), where 
T^^^ = uSi + p^Qi with 6u = (^Pu ~ Poi^ (cf- ®)- We then account for the (m — 1) slots of 
passivity by exploiting ([8]), so that r2(a;i(l)) = E^^[c<;i(l + m)\ui{l)] = (f)^^\ui{t)), where we 
have set (f)'^^\ui) = t^^~^\tI^\uj)) = ojam + i'm with = ^i^^""*^ and il)m = P^qi^^'"^ + 
Pm ^~i-5q ' ^'^^ where Tq^\uj) = Tq^\tq^^'^\uj)) indicates the belief of a node after k slots of 
passivity when the initial belief is uj (i.e., Tq''\uj) is obtained recursively by applying Tq^\uj) to 
itself k times). In general, we can obtain rj{uJi{l))= E^^[uji{l + (j — l)m)\uji{l)], for j > 2, by 
iterating the procedure above by applying (f)^^\uj) to itself (j — 1) times. After a little algebra we 
get ^(^■-i)(u;) = 0(i)(0(^'-2)(a;)) = ujai^^ + i^m^z^-, so that r,-(u;,(l)) = ^(^■-i)(a;i(l)), where 
we set (p^^^u) = CO. The reasoning above can be applied when starting from any arbitrary slot t. 

Finally, the total reward accrued by the CC from a node that is scheduled H times, when its 
belief at the first slot in which it is scheduled is cu, can be calculated by summing up the average 
reward rj{-) during each slot in which the node is scheduled (see definition above), as 

(.) ^ tp^^-^^-r,i.) ^ ( - 1^1^] +1^1^.. (38) 

Note that, for a node Ui E Gg, for g > 1 and with belief equal to w at t = 1, the first slot in 
which the node is scheduled is t = g , and thus its belief at time t = g becomes Tq^^^^ (u) (i.e., 
after {g — 1) slots of passivity while other groups are scheduled). Therefore, for a node Ui E Gg, 
with initial belief u, the total contribution to the throughput is given by P^~^9^^^ {'''o^^^^ {^) 
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Let us now focus on the second step, i.e., proving the inequality (fT6l) . At t = T, it is easily 
seen to hold due to (|3]) and (fT2l) . We then need to show that (fT6l) also holds at t. To do so, let 
us denote as C and TZ the RR policies whose throughputs are given by the LHS and RHS of 
(fT6l) respectively. The differences between C and TZ are the positions of the nodes with belief x 
and y in the initial belief vectors. Therefore, some of the m groups created by the two policies 
might have different nodes (see the RR operations in Proposition [T]). To simplify, we refer to 
the node with belief x (y) as node x (y). Let us assume that nodes x and y belong to groups 
Qg' and Qgn under policy TZ, respectively, while they belong to groups Qgn and Qg' under policy 
C, respectively, with g" > g', and g', g" E {1, m}. If g" = g', then the two policies coincide 
and (fT6l) holds with equality. If g" = g' + 1 (nodes are adjacent but do not belong to the same 
group), the only difference between policies C and TZ is the scheduling order of nodes x and y. 

To verify that inequality (fT6l) holds, we need to prove that scheduling node y in group Qgi and 
node X in group Qgii is no better than doing the opposite for any x > y.To elaborate, let H^(t) = 
Hy{t) and H^{t) = H^{t) be the number of times that node x (or y) is scheduled under policy 
TZ (or C) and node y (or x) is scheduled under policy C (or TZ) in the horizon {t, t + 1, T}, 
respectively. By recalling (|38]) and the discount factor (3, the contribution generated by node x 
and y under policy TZ is I3s'-^e^^^^'^) {4^''^\x)) and l3s"-^e^^y^''>){Tf'^\y)) respectively, 
and similarly under policy C we have (ro^^""^^(x)) and l39'-^e^"y^''>) {r^l'' '^\y)). 

Note that, in the argument of function 9'^'\-), we have considered that the nodes in group Qgi are 
scheduled for the first time at slot g' — I, and thus the belief must be updated through function 
Tq^ (■)' similarly for nodes in Qgii the first slot is g" — 1. Moreover, the discount factor 
is (3^'~^ is common to all the nodes in group Qg/, and so is /5^"^^ for group Qg". 

By recalling that all the nodes, except x and y, are scheduled at the same slot under the two 
policies TZ and C (thus giving the same contribution to the throughput), the inequality (fT6l) can 
thus be reduced to pa'-iei^m {rl,^'-^\x))+P'^''M''^^'^) {Tl,'''-'\y))-f^^^^ 
l39'-^ei^y^^^){Tl^^''^\y)) > 0, which must hold for all admissible H^{t) = H^{t) and Hf{t) 
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= H^{t) and all g" > g', with g', g" e {1, ...,m}. There are two cases: 1) H^{t) = H^{t) = 
H^{t) = H^(t) = H > 1, that is, nodes x and y are scheduled the same number of times within 
the horizon of interest under the two policies TZ and £; 2) H^{t) = Hy{t) = H, and H^{t) 
= H^(t) = H — 1, for H > 1, namely, node x (or y) is scheduled one time more than node 
y (or X ) under policy 71 (or £). By exploiting the RHS of (l38l) . after a little algebra, one can 
verify that the inequality above holds in both cases, which concludes the proof of Theorem [31 

Appendix B 
Proof of LemmaE] 

Proof of case a). From (l23])-(l2ll), and recalling that Tfj^^ (0) = poi from dlSl), the leftmost 
inequality in (I27a[ l) follows immediately as it becomes Kn(0|l) = PV^{poi) < rn + l3V^{poi) = 
Ki(0|0). For the rightmost inequality in (l27aln. we have V'm(lll) = 1 + PV^{0), while from 
(EB and the fact that i;„(0|l) < K^(0|0) we have \/^(0) = max {V;,(0|0), i;„(0|l)} = 14^(010). 
Therefore, we have Kn(l|l) = 1 + /3V^{0)1 + (3V^{0\0) > Kn(0|0), which holds as 1 + 
l3Vjn{0\0) > Vm{0\0) implies Kn(0|0) < j^. The latter bound always holds, since for m < 
1 the infinite horizon throughput is upper bounded as V^{u) < J^t^of^ = given that 
we can get at most a reward of Rm{to,u) < 1 in each slot. Hence, inequalities (I27a[ l) are 
proved. Inequality (|27a[ 2) can be proved by contradiction. Specifically, let us assume that: hp.l) 
Vm{l\0) > Kn(l|l). From (EB we would have = max {14(1|0), Kn(l|l)} = 14(1|0), 

i.e., the passive action would be optimal when cu = 1. Moreover, from (|23l) we would have 
V;^(1|0) = m + /3V^{1) = m + /3Kn(l|0), which can be solved with respect to Vm{l\0) 
to get 14(1|0) = = ^7^(1)- Therefore, if hypothesis hp.l) holds, we also have that 
14(111) = 1 + /3V;;(0) < Kn(l|0) = V;^{1) = However, the value function 14(w) is 
bounded < V^{uj) < where the lower bound is obtained considering a policy that 
always chooses the passive action for any belief to. The boundedness of the value function, thus 
implies that if hp.l) holds then 1 + < 1 + /3V4(0) = Kn(l|l) < Kn(l|0) = which 
yields 1 + (3j^ < jza and thus (1 — /3) (1 — m) < 0. But this is clearly impossible as m, /3 < 1. 
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Consequently, we have proved that > K«(1|0). 

Proof of case b) Inequality Ki(0|0) < Vm(0|l) follows immediately since m + PV^{poi) < 
/3V^{poi) holds for m < 0. The second inequality Kn(0|l) < Kn(l|l) becomes Vm(0|l) <= 
1 + /3V^{0)1 + f3Vm{0\l), which leads to Kn(0|l) < jz^, which always holds as discussed above. 
Inequality \4i(l|0) < holds since an active action is always optimal when m < 0. 

Proof of case c) The inequality holds since a passive action is always optimal for any m > 1. 

Appendix C 
Proof of Theorem [8] 

Following the discussion in Sec. IIV-A21 to prove indexability it is sufficient to show that 
the threshold uj*{m) is monotonically increasing with the subsidy m, for < m < 1. In fact, 
from Proposition [5] the passive set (l25l) for m < is V{m) = 0, while for m > 1, we have 
V{m) = [0, 1]. We then only need to prove the monotonicity of uj*{m) for < m < 1, which 
has been shown to hold in ^ Lemma 9] if 



dVrr,(u\l] 



(39) 

uj=u]* (m) 



dm 

To check if (|39l ) holds, we differentiate (I23l)-(l24l) at the optimal threshold u = u*{m) as 



Vm{co*{m)\l) = u*{m) + f3co* {m)V:^{0) + - a;*(m))K:(poi), and (40) 



VUu:*{m)\0) = m + /3y^)(u;*(m))(l + /3K;(0))+/3(l-ro«(a;*(m)))K;(p, 



'01 j 



(41) 



where (|4TI) follows from (|24l) and from the fact that Tq^\uj) > co, for any uj (see (|29l)). and hence 
V^{tq^\uj* (m))) = Vm{TQ^\oo*{Tn))\l), since arm activation is optimal for any co > uj*{m). 
By letting Dmiuj) = then from m we have ^^^^ = /3a;*(m)D^(0) + 



- a;*(m))D™(poi), while from (gD we get ^^^^^ = 1 + /3\^'^(c^*)/^n^(0) + 



uj=ui* (m) 

2^(1), 

— 1 -r 1^ 

uj=uj* (m) 



l3'^{l—TQ^\uj*))Dm{poi)- Finally, after some algebraic manipulations, and recalling that -Dm(O) 
= d(m+pv*{poi)) ^ ^ ^ recur sivelyPD^ipoi), we can rewrite ^ as 
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D^{p^i)(3 (1 - /3) [1 - a; (1 - - poi))] + /? (1 - - Poi)) - I^Poi] < 1. To show that 
the last inequality holds when < m < 1, we first upper bound the derivative of the value 
function as Dmii^) < j^, since -£^Rm{oj) < 1. Finally, using this upper bound Dm{pm) < 
after a little algebra (l39l) reduces to — /3poi) < 1; which clearly holds for any /3 G [0, 1) as 
< Poi ^ 1- This concludes the proof of Theorem [8l 
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